min 1
Multi-Agent Learning with Heterogeneous Linear Contextual Bandits
As trained intelligent systems become increasingly pervasive, multi-agent learning has emerged as a popular framework for studying complex interactions between autonomous agents. Yet, a formal understanding of how and when learners in heterogeneous environments benefit from sharing their respective experiences is still in its infancy. In this paper, we seek answers to these questions in the context of linear contextual bandits. We present a novel distributed learning algorithm based on the upper confidence bound (UCB) algorithm, which we refer to as H-LINUCB, wherein agents cooperatively minimize the group regret under the coordination of a central server. In the setting where the level of heterogeneity or dissimilarity across the environments is known to the agents, we show that H-LINUCB is provably optimal in regimes where the tasks are highly similar or highly dissimilar.
ATechnical Lemmas
The proof is an induction on k. Consider the general case p2k+1. It is easy to see that g (x) = ex p2k(x) and g (x) = ex p2k 1(x). By the induction hypothesis, g 0 and therefore g is convex. Thus, the minimum of g is given by its stationary points. It is easy to observe that x = 0 is indeed a stationary point. Thus, minx R g(x) = g(0) = 0, which finishes the proof.
Fast Projection onto the Capped Simplex with Applications to Sparse Regression in Bioinformatics
We consider the problem of projecting a vector onto the so-called k-capped simplex, which is a hyper-cube cut by a hyperplane. For an n-dimensional input vector with bounded elements, we found that a simple algorithm based on Newton's method is able to solve the projection problem to high precision with a complexity roughly about O(n), which has a much lower computational cost compared with the existing sorting-based methods proposed in the literature. We provide a theory for partial explanation and justification of the method. We demonstrate that the proposed algorithm can produce a solution of the projection problem with high precision on large scale datasets, and the algorithm is able to significantly outperform the state-of-the-art methods in terms of runtime (about 6-8 times faster than a commercial software with respect to CPU time for input vector with 1 million variables or more). We further illustrate the effectiveness of the proposed algorithm on solving sparse regression in a bioinformatics problem. Empirical results on the GWAS dataset (with 1,500,000 single-nucleotide polymorphisms) show that, when using the proposed method to accelerate the Projected Quasi-Newton (PQN) method, the accelerated PQN algorithm is able to handle huge-scale regression problem and it is more efficient (about 3-6 times faster) than the current state-of-the-art methods.
Supplementary materials AOn the Definition of LOTr,c
Let (X,dX) and (Y,dY) two nonempty compact Polish spaces, µ 2M +1 (X), 2M +1 (Y) two probability measures on these spaces and c: X Y! R+ a nonnegative and continuous function. As X and Y are compact, r(µ,) is tight, then Prokhorov's theorem applies and the closure of r(µ,) is sequentially compact. Let us now show that r(µ,) is closed. Indeed, Let ( n)n 0 a sequence of r(µ,) converging towards . In addition as ( n)n 0 live in the simplex r, we can also extract a sub-sequence, such that n! 2 r.
Contents of Appendix
Bayes-consistency only holds for the full family of measurable functions, which of course is distinct from the more restricted hypothesis set used by a learning algorithm. Therefore, a hypothesis setdependent notion of H-consistency has been proposed by Long and Servedio (2013) in the realizable setting, used by Zhang and Agarwal (2020) for linear models, and generalized by Kuznetsov et al. (2014) to the structured prediction case. Long and Servedio (2013) showed that there exists a case where a Bayes-consistent loss is not H-consistent while inconsistent losses can be H-consistent. Zhang and Agarwal (2020) further investigated the phenomenon in (Long and Servedio, 2013) and showed that the situation of losses that are not H-consistent with linear models can be remedied by carefully choosing a larger piecewise linear hypothesis set. Kuznetsov et al. (2014) proved positive results for the H-consistency of several multi-class ensemble algorithms, as an extension of H-consistency results in (Long and Servedio, 2013). Recently, the notions of H-calibration and H-consistency have been used by Bao et al. (2020); Awasthi et al. (2021a) in the study of adversarial binary classification losses, as defined in (Goodfellow et al., 2014; Madry et al., 2017; Tsipras et al., 2018; Carlini and Wagner, 2017; Awasthi et al., 2023).
Spectral bandits for smooth graph functions
Valko, Michal, Munos, Rémi, Kveton, Branislav, Kocák, Tomáš
Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.
Supplementary Materials: Semi-Supervised Contrastive Learning for Deep Regression with Ordinal Rankings from Spectral Seriation
The main result is presented in Theorem 2. According to the definition of the Fiedler vector, we have ( L + L)( f + f) = ( λ + λ)( f + f). We outline the proof below for interested readers. The main result is presented in Theorem 2. We first present Stewart's theorem in Lemma 1 to assist Actual times may differ depending on hardware and environment. We also show the number of model parameters required for each method in Table S3. Hyper-parameters were selected based on a coarse search on the validation set.